pacman::p_load(sf,sfdep,tidyverse,tmap)Take-Home_Ex1
Analysis of Geospatial Distribution of Bus Ridership by Origin Bus Stop in Singapore
Objectives
The bus system is one of Singapore’s two pillars of public transport aside from the MRT. The bus system ensures convenient and affordable short-, medium-, and long-distance travel for riders. Thanks to the widespread availability of bus stops as compared to MRT stations, it has a high level of accessibility. However, this also leaves the system prone to under- or over-investment in terms of the number of bus routes, leading some stops and routes to be under-served or over-served.
The objective of this study is to examine the distribution of bus trips in Singapore by analyzing the number of trips by originating bus stops. It will consist of two levels of analysis:
- GeoVisualisation and Analysis: Visualizing the number of trips by originating bus stops and provide descriptive statistics of the distribution of trips by bus stops.
- Local Indicators of Spatial Association Analysis (LISA): This analysis involves the calculation of Local Moran’s I to determine local spatial autocorrelation between a bus stop and its neighbors. Additionally, visualizations such as a LISA cluster map will be created for easier comparison.
Getting Started
First, the necessary R packages will be loaded using the p_load() function of the pacman package. p_load() will also install any package which is not already installed. The following packages will be loaded:
sf: For handling of geospatial data.
sfdep: For determining the spatial dependence of spatial features. The three main categories of functionality relates to the determination of geometry neighbors, weights, and LISA.
tidyverse: For manipulation of non-spatial data. This package contains ggplot2 for plotting, dplyr and tidyr for dataframe manipulation, and readr for reading comma-separated values (CSV).
tmap: For thematic mapping, especially the mapping of simple features data frame.
Importing Required Data
For the purpose of this study, two types of data will be used: geospatial data which consists of spatial features and their coordinates information, and aspatial data which consists of attributes which can be ascribed to the geospatial data. Specifically, the following datasets will be used for each type:
- Geospatial Data:
- BusStop.shp: This shape file contains the location of the bus stops in Singapore as at July 2023. This file can be retrieved from the Land Transport Authority (LTA) Data Mall (link).
- Aspatial Data:
- origin_destination_bus_202309.csv: This CSV file contains the detail of bus trips from an originating bus stop to a destination bus stop, identified by their unique codes, each hour of the day during September 2023. The data is further broken down into weekend or weekday, but not by the specific day of the week. This data can be retrieved by using the LTA Data Mall’s API (link).
The first steps taken will be to import these files into the R environment in a manipulable format.
Importing Geospatial Data
Geospatial data can be imported using the st_read() function of the sf package. This will import the file into the R environment as a sf (simple features) data frame. st_transform() is added to transform the Coordinate Reference System (CRS) to EPSG: 3414, which is the CRS of Singapore.
In st_read():
dsn: the directory where the shape file is stored
layer: the name of the shape file
busstop <- st_read(dsn = 'data/geospatial',
layer = 'BusStop') %>%
st_transform(crs = 3414)Reading layer `BusStop' from data source
`D:\phlong2023\ISSS624\Take-Home_Ex\Take-Home_Ex1\data\geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21
From the message provided by R, it can be seen that the busstop sf data frame has 5161 rows, 3 columns, and has a CRS of SVY 21.
To get a better grasp of the busstop data frame, glimpse() function can be used.
The data type for each column can be seen as well as some of their values. For sf data frames, there is a geometry column (POINT type) which contains the location information for each polygon.
glimpse(busstop)Rows: 5,161
Columns: 4
$ BUS_STOP_N <chr> "22069", "32071", "44331", "96081", "11561", "66191", "2338…
$ BUS_ROOF_N <chr> "B06", "B23", "B01", "B05", "B05", "B03", "B02A", "B02", "B…
$ LOC_DESC <chr> "OPP CEVA LOGISTICS", "AFT TRACK 13", "BLK 239", "GRACE IND…
$ geometry <POINT [m]> POINT (13576.31 32883.65), POINT (13228.59 44206.38),…
Importing Aspatial Data
The read_csv() function of readr can be used to import the origin_destination_bus_202309 CSV file into the R environment as a data frame.
passenger <- read_csv('data/aspatial/origin_destination_bus_202309.csv')From the message provided by R, it can be seen that the passenger has 5,714,196 rows and 7 columns.
head() can be used instead of glimpse() to view the top five rows of the passenger data frame. This will also allow us to see the data type of each of the column.
head(passenger)# A tibble: 6 × 7
YEAR_MONTH DAY_TYPE TIME_PER_HOUR PT_TYPE ORIGIN_PT_CODE DESTINATION_PT_CODE
<chr> <chr> <dbl> <chr> <chr> <chr>
1 2023-09 WEEKENDS/… 17 BUS 24499 22221
2 2023-09 WEEKENDS/… 10 BUS 65239 65159
3 2023-09 WEEKDAY 10 BUS 65239 65159
4 2023-09 WEEKDAY 7 BUS 23519 23311
5 2023-09 WEEKENDS/… 7 BUS 23519 23311
6 2023-09 WEEKENDS/… 11 BUS 52509 42041
# ℹ 1 more variable: TOTAL_TRIPS <dbl>
Note that the ORIGIN_PT_CODE and DESTINATION_PT_CODE are in the character (“chr”) data type. However, we would like it to be in the factor (“fctr”) data type for easier categorization and sorting. This can be done by using the as.factor() function.
passenger$ORIGIN_PT_CODE <- as.factor(passenger$ORIGIN_PT_CODE)
passenger$DESTINATION_PT_CODE <- as.factor(passenger$DESTINATION_PT_CODE)We can use head() to check the data type of the passenger data frame.
head(passenger)# A tibble: 6 × 7
YEAR_MONTH DAY_TYPE TIME_PER_HOUR PT_TYPE ORIGIN_PT_CODE DESTINATION_PT_CODE
<chr> <chr> <dbl> <chr> <fct> <fct>
1 2023-09 WEEKENDS/… 17 BUS 24499 22221
2 2023-09 WEEKENDS/… 10 BUS 65239 65159
3 2023-09 WEEKDAY 10 BUS 65239 65159
4 2023-09 WEEKDAY 7 BUS 23519 23311
5 2023-09 WEEKENDS/… 7 BUS 23519 23311
6 2023-09 WEEKENDS/… 11 BUS 52509 42041
# ℹ 1 more variable: TOTAL_TRIPS <dbl>
Data Preparation
In order to perform our analysis, certain manipulations must be made in order to prepare the data. Specifically, the passenger data set will be filtered and summarzied. Subsequently, it will be combined with the busstop data set based on the bus stop code variable present in both data frames.
Wrangling Aspatial Data
Filtering the passenger Data Set for Desired Time Frames
For the purpose of this study, the passenger data set needs to be filtered to only contain trips falling within one of the following time frames:
| Peak hour period | Bus tap on time |
|---|---|
| Weekday morning peak | 6am to 9am |
| Weekday afternoon peak | 5pm to 8pm |
| Weekend/holiday morning peak | 11am to 2pm |
| Weekend/holiday evening peak | 4pm to 7pm |
This can be accomplished using the filter() function and the dplyr steps. We can create four separate data frames to store the four different time frames
# Weekday morning peak 6am - 9am
passenger_wd_69 <- passenger %>%
filter(DAY_TYPE == 'WEEKDAY') %>%
filter(TIME_PER_HOUR >= 6 & TIME_PER_HOUR <= 9)
# Weekday afternoon peak 5pm - 8pm (17:00 - 20:00)
passenger_wd_1720 <- passenger %>%
filter(DAY_TYPE == 'WEEKDAY') %>%
filter(TIME_PER_HOUR >= 17 & TIME_PER_HOUR <= 20)
# Weekend/holiday morning peak 11am - 2pm (11:00 - 14:00)
passenger_weh_1114 <- passenger %>%
filter(DAY_TYPE == 'WEEKENDS/HOLIDAY') %>%
filter(TIME_PER_HOUR >= 11 & TIME_PER_HOUR <= 14)
# Weekend/holiday evening peak 4pm - 7pm (16:00 - 19:00)
passenger_weh_1619 <- passenger %>%
filter(DAY_TYPE == 'WEEKENDS/HOLIDAY') %>%
filter(TIME_PER_HOUR >= 16 & TIME_PER_HOUR <= 19)After the different trips have been categorized into their separate data frames, the total number trips for each origin bus stop can be tallied into a single statistic for the study period. This can be accomplished using the summarize() function. The example below shows this operation using passenger_wd_69.
The group_by() function is used to instruct R to conduct operations based on the groups created by group_by(). In this case, the summary operations will be done based on the origin bus stop codes.
# Tallying the trips by origin bus stop for Weekday morning peak 6am - 9am
passenger_wd_69_tallied <- passenger_wd_69 %>%
group_by(ORIGIN_PT_CODE) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
passenger_wd_69_tallied# A tibble: 5,020 × 2
ORIGIN_PT_CODE TRIPS
<fct> <dbl>
1 01012 1640
2 01013 764
3 01019 1322
4 01029 2373
5 01039 2562
6 01059 1582
7 01109 144
8 01112 7993
9 01113 6734
10 01119 3736
# ℹ 5,010 more rows
As can be seen, the newly created data frame consists only of the total trip numbers for each origin bus stop. This can be repeated for the other time frames.
# Tallying the trips by origin bus stop for Weekday afternoon peak 5pm - 8pm (17:00 - 20:00)
passenger_wd_1720_tallied <- passenger_wd_1720 %>%
group_by(ORIGIN_PT_CODE) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
# Tallying the trips by origin bus stop for Weekend/holiday morning peak 11am - 2pm (11:00 - 14:00)
passenger_weh_1114_tallied <- passenger_weh_1114 %>%
group_by(ORIGIN_PT_CODE) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))
# Tallying the trips by origin bus stop for Weekend/holiday evening peak 4pm - 7pm (16:00 - 19:00)
passenger_weh_1619_tallied <- passenger_weh_1619 %>%
group_by(ORIGIN_PT_CODE) %>%
summarise(TRIPS = sum(TOTAL_TRIPS))Wrangling Geospatial Data
In order to adequately visualize the busstop sf data frame, we need to define a mapping layer. An example of a mapping layer would be to use the Master Plan 2019 Planning Sub-zone created by the Urban Redevelopment Authority (URA). However, for the purpose of this study, a hexagon layer will be used to ensure standardization of the size of each polygon and the evenly spaced gaps between a polygon and its neighbors.
The steps in this section will detail the creation of the hexagon layer using the busstop data frame and visualize the layer on a map of Singapore.
Creating a Hexagon Layer in R
The steps taken in this section is based on the guide provided by Kenneth Wong of Urban Data Palette (link).
Firstly, a hexagon or honeycomb grid can be created based on the busstop data frame using the st_make_grid() function.
There are some notable arguments in the st_make_grid() function:
cellsize = c(100,100): This argument indicates the size of each hexagon. If the cell size is large, each hexagon can encompasses multiple bus stops, whereas if a smaller cell size can help us differentiate between individual bus stop. However, a smaller cell size with many hexagons will take more time to create.
what = ‘polygons’: We would like to create polygons on a grid.
square = FALSE: The default argument is TRUE, which would create a square grid. FALSE is specified in order to create a hexagon grid.
area_honeycomb_grid = st_make_grid(busstop, cellsize = c(250,250), what = 'polygons', square = FALSE)
area_honeycomb_gridGeometry set for 22134 features
Geometry type: POLYGON
Dimension: XY
Bounding box: xmin: 3720.122 ymin: 26337.76 xmax: 48470.12 ymax: 53256.72
Projected CRS: SVY21 / Singapore TM
First 5 geometries:
The area_honeycomb_grid contains 136906 features of the same Projected CRS as the busstop data frame. If the plot() function is used, the hexagon grid will be displayed. However, this grid contains no information and might be too small to discern the individual cell.
#qtm(area_honeycomb_grid)The area_honey_comb needs to be converted to a sf data frame for further manipulation using st_sf(). Additionally, we can assign a unique id to each of the hexagon cell in area_honey_comb using mutate().
honeycomb_grid_sf = st_sf(area_honeycomb_grid) %>%
# add grid ID
mutate(grid_id = 1:length(lengths(area_honeycomb_grid)))Following this, we can use lengths() and st_intersect() to determine the allocation of bus stop in each cell. The goal is to create a new column, consisting of the number of bus stop in each of the cell. The filter() function can then be added to remove all cells with no bus stop and create the final sf data frame.
# Counting the number of bus stop in each cell
honeycomb_grid_sf$n_busstop = lengths(st_intersects(honeycomb_grid_sf,busstop))
# Removing all cells without bus stop
honeycomb_count = filter(honeycomb_grid_sf, n_busstop > 0)At this point, the hexagon grid of bus stop can be drawn onto a map of Singapore using the functions of the tmap package. Additionally, the n_busstop column can be passed to the tm_fill() function to shade the cell based on the number of bus stops in it.
Several functions are added to make the map interactive and aesthetically pleasing
tmap_mode(‘view’): Creates an interactive map which allow zooming and interacting with cells on the map
pop.vars: Identifying the legend and value which pops up when a cell is selected. In this case, it is the number of bus stops.
popup.format: Specifying the format of the variable to be displayed when selecting a cell.
tm_basemap: Choosing the basemap layer on which the hexagon grid will be drawn. OpenStreetMap is chosen due to its high fidelity while not being overly crowded. Additionally, OpenStreetMap displays icon for bus stops in Singapore, allowing user to visually check any cell.
- If an incorrect CRS was specified in the earlier steps, the basemap will be of an incorrect location or alignment.
tmap_mode('view')
bushexmap <- tm_shape(honeycomb_count)+
tm_fill(
col = "n_busstop",
palette = "Blues",
style = "cont",
title = "Number of bus stop",
id = "grid_id",
showNA = FALSE,
alpha = 0.7,
popup.vars = c(
"Number of bus stop: " = "n_busstop"
),
popup.format = list(
n_busstop = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_basemap(server = c('Esri.WorldGrayCanvas'))
bushexmapBy looking at the illustration, we can see that each cell might contain up to four bus stops. A bar chat can be drawn with the ggplot2 package to visualize the distribution of number of bus stop in each cell.
ggplot(honeycomb_count, aes(x=n_busstop))+
geom_bar()+
theme_classic()+
geom_text(aes(label = ..count..), stat = "count", vjust = -0.5, colour = "black")
As can be seen, the majority of cells contain 1-2 bus stop with only 250 cells containing more than 3 bus stops. This shows that the cells adequately capture solitary bus stop, as well as pairs of bus stops (bus stops which are opposite each other, served by the same bus services).
Combining Aspatial and Geospatial Data
In order to conduct geospatial analysis, a data frame which contains the hexagon cells as well as the number of bus trips for each cells must be created. This can be done using the left_join argument.
There are important arguments which can be used to create a cleaner combined data frame.
by = join_by(BUS_STOP_N == ORIGIN_PT_CODE)): Indicate the column by which the two data frames can be matched and joined. In this case, the bus stop code will be used.
select(1,4,5): Indicate the index number of the columns to be kept in the final data frame. Only the bus stop number (column 1), total number of trips (column 4), and geometry (column 5) will be kept.
replace(is.na(.),0): Replace all value of NA with 0. This is to ensure that bus stop with no trips in a given time frame is accurately tallied at 0.
# Weekday morning peak 6am - 9am
passenger_wd_69_combined <- left_join(busstop, passenger_wd_69_tallied, by = join_by(BUS_STOP_N == ORIGIN_PT_CODE))%>%
select(1,4,5)%>%
replace(is.na(.),0)
# Weekday afternoon peak 5pm - 8pm (17:00 - 20:00)
passenger_wd_1720_combined <- left_join(busstop, passenger_wd_1720_tallied, by = join_by(BUS_STOP_N == ORIGIN_PT_CODE))%>%
select(1,4,5)%>%
replace(is.na(.),0)
# Weekend/holiday morning peak 11am - 2pm (11:00 - 14:00)
passenger_weh_1114_combined <- left_join(busstop, passenger_weh_1114_tallied, by = join_by(BUS_STOP_N == ORIGIN_PT_CODE))%>%
select(1,4,5)%>%
replace(is.na(.),0)
# Weekend/holiday evening peak 4pm - 7pm (16:00 - 19:00)
passenger_weh_1619_combined <- left_join(busstop, passenger_weh_1619_tallied, by = join_by(BUS_STOP_N == ORIGIN_PT_CODE))%>%
select(1,4,5)%>%
replace(is.na(.),0)It is important to note that the bus stops and their total trips have not been tallied into the hexagon cells. st_join() can be used to accomplish this for each time frame.
The by argument in st_join() can be passed the function st_within to specify that we would like to join the two data frames where the geometry in the latter is within the geometry of the former. In this case, it would mean that two rows will be joined where the bus stop lies within a particular polygon.
- The group_by() and summarise() functions here are used similarly to before, they sums up the total number of trips for all the bus stops in the hexagon, based on its grid_id, and create a new column called TOTAL_TRIP.
# Weekday morning peak 6am - 9am
hex_passenger_wd_69 <- st_join(honeycomb_count, passenger_wd_69_combined, by = st_within(sparse = FALSE), largest = TRUE) %>%
group_by(grid_id)%>%
summarise(TOTAL_TRIP = sum(TRIPS))
# Weekday afternoon peak 5pm - 8pm (17:00 - 20:00)
hex_passenger_wd_1720 <- st_join(honeycomb_count, passenger_wd_1720_combined, by = st_within(sparse = FALSE), largest = TRUE) %>%
group_by(grid_id)%>%
summarise(TOTAL_TRIP = sum(TRIPS))
# Weekend/holiday morning peak 11am - 2pm (11:00 - 14:00)
hex_passenger_weh_1114 <- st_join(honeycomb_count, passenger_weh_1114_combined, by = st_within(sparse = FALSE), largest = TRUE) %>%
group_by(grid_id)%>%
summarise(TOTAL_TRIP = sum(TRIPS))
# Weekend/holiday evening peak 4pm - 7pm (16:00 - 19:00)
hex_passenger_weh_1619 <- st_join(honeycomb_count, passenger_weh_1619_combined, by = st_within(sparse = FALSE), largest = TRUE) %>%
group_by(grid_id)%>%
summarise(TOTAL_TRIP = sum(TRIPS))Let’s try to map these hexagon grids to see if the visualization aligns with what we would expect.
tmap_mode('view')
# Weekday morning peak 6am - 9am
weekday_morning <- tm_shape(hex_passenger_wd_69) +
tm_fill(
col = "TOTAL_TRIP",
palette = "Blues",
style = "quantile",
title = "Number of Trips",
id = "grid_id",
showNA = FALSE,
alpha = 0.7,
popup.vars = c(
"Number of Trips: " = "TOTAL_TRIP"
),
popup.format = list(
TOTAL_TRIP = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_basemap(server = c('Esri.WorldGrayCanvas'))+
tm_layout(title = 'Weekday Morning Peak', title.position = c('right', 'top'))
# Weekday afternoon peak 5pm - 8pm (17:00 - 20:00)
weekday_afternoon <- tm_shape(hex_passenger_wd_1720) +
tm_fill(
col = "TOTAL_TRIP",
palette = "Blues",
style = "quantile",
title = "Number of Trips",
id = "grid_id",
showNA = FALSE,
alpha = 0.7,
popup.vars = c(
"Number of Trips: " = "TOTAL_TRIP"
),
popup.format = list(
TOTAL_TRIP = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_basemap(server = c('Esri.WorldGrayCanvas'))+
tm_layout(title = 'Weekday Afternoon Peak', title.position = c('right', 'top'))
# Weekend/holiday morning peak 11am - 2pm (11:00 - 14:00)
weekend_noon <- tm_shape(hex_passenger_weh_1114) +
tm_fill(
col = "TOTAL_TRIP",
palette = "Blues",
style = "quantile",
title = "Number of Trips",
id = "grid_id",
showNA = FALSE,
alpha = 0.7,
popup.vars = c(
"Number of Trips: " = "TOTAL_TRIP"
),
popup.format = list(
TOTAL_TRIP = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_basemap(server = c('Esri.WorldGrayCanvas'))+
tm_layout(title = 'Weekend/holiday Morning Peak', title.position = c('right', 'top'))
# Weekend/holiday evening peak 4pm - 7pm (16:00 - 19:00)
weekend_evening <- tm_shape(hex_passenger_weh_1619) +
tm_fill(
col = "TOTAL_TRIP",
palette = "Blues",
style = "quantile",
title = "Number of Trips",
id = "grid_id",
showNA = FALSE,
alpha = 0.7,
popup.vars = c(
"Number of Trips: " = "TOTAL_TRIP"
),
popup.format = list(
TOTAL_TRIP = list(format = "f", digits = 0)
)
) +
tm_borders(col = "grey40", lwd = 0.7)+
tm_basemap(server = c('Esri.WorldGrayCanvas'))+
tm_layout(title = 'Weekend/holiday Evening Peak', title.position = c('right', 'top'))
tmap_arrange(weekday_morning, weekday_afternoon, weekend_noon, weekend_evening, nrow = 2, ncol = 2)Search for the bus stop near your place and compare the numbers between the four maps by zooming in! Do you think it’s accurate?
From the rough visualization, it’s clear that not all bus stops experience a similar level of traffic throughout different timing of the days. Additionally, based on the quantiles created by tmap, it seems that the ranges of passenger traffic are radically different between the four time windows. For example, certain grids during Weekday Morning Peak, a hexagon could reach 328,545 passenger trips, whereas the highest number during Weekend/holiday Morning Peak only reaches 112,330.
Cleanup Step
Before we move on to the analysis of LISA, it would be wise to remove objects which will no longer be used. This will help to free up memory for other tasks. rm() can be used to perform this task
rm(list = c('area_honeycomb_grid', 'bushexmap', 'honeycomb_count', 'honeycomb_grid_sf',
'passenger_wd_1720', 'passenger_wd_1720_tallied', 'passenger_wd_69', 'passenger_wd_69_tallied',
'passenger_weh_1114', 'passenger_weh_1114_tallied',
'passenger_weh_1619', 'passenger_weh_1619_tallied'))